Day 2 Session 1:
Dynamic spectral analysis
using Functional Principal Component Analysis (FPCA)
Introduction
Yesterday, we looked at two statistical analysis methods: linear mixed-effect models (LMEs) and generalised additive mixed-effect models (GAMMs). These techniques can, in a way, be considered as top-down approaches to both static (LMEs) and dynamic analyses (GAMMs) given that predictor variables need to be specified beforehand. These models are useful when you have some knowledge about specific variables of interest.
In contrast, our studies can often be exploratory, in which we do not have much prior knowledge regarding e.g., what predictor variables influence the outcome variables, how predictor variables interact with each other, etc. In other words, we would like to see what structures emerge from the data.
Today, we will take a brief look at two bottom-up, exploratory approaches to data analysis: principal component analysis (PCA) and functional principal component analysis (FPCA). These tools can be useful when (1) reducing data dimensions into a manageable number of variables and (2) identifying the degrees to which each parameter contributes to the data structure.
To contexualise today’s workshop, we actually know already that the lower three formants (F1, F2 and F3) are important dimensions to understand L1 Japanese speakers’ production of L2 English /l/ and /ɹ/, so it might not be that interesting if we stick to these. For this reason, let’s extend our scope of data analysis and take duration into account.
Below, similarly to yesterday, we’ll start with the static data analysis using PCA. FPCA is an extension of PCA, so you’d find it easier to understand FPCA once you’ve seen PCA first.
Principal Component Analysis (PCA)
Let’s apply Principal Component Analysis (PCA) onto the static data.
Preliminaries
Checking data and data wrangling
As usual, let’s check what variables are included in the data set
using colnames().
## [1] "...1" "file" "speaker" "language"
## [5] "duration" "segment" "word" "f1"
## [9] "f2" "f3" "previous_sound" "next_sound"
## [13] "percent" "IsApproximant" "IsAcoustic" "gender"
## [17] "omit" "position"
Also remove irrelevant columns and rename the variables in the
vowel column as has been done previously.
# Remove columns
df_mid <- df_mid |>
dplyr::select(-c(IsApproximant, IsAcoustic, omit))
# Rename vowel variables
df_mid <- df_mid |>
dplyr::mutate(
vowel = case_when(
next_sound == "AE1" ~ "/æ/",
next_sound == "IY1" ~ "/i/",
next_sound == "UW1" ~ "/u/",
)
)
# z-normalise formant values for cross-speaker comparison
df_mid <- df_mid |>
dplyr::group_by(speaker) |>
dplyr::mutate(
f1z = scale(f1),
f2z = scale(f2),
f3z = scale(f3)
) |>
dplyr::ungroup()Data visualisation
Let’s try data visualisation. First, we’ll compare between-group
difference using a combination of scatter, box and violin plots. In
order to plot everything altogether, we’ll convert the data set into
long data using the tidyr::pivot_longer() function.
Spectral characteristics
Violin plot
# convert the data set into a long data
df_mid_long <- df_mid |>
dplyr::select(-c(f1, f2, f3)) |>
tidyr::pivot_longer(14:16, names_to = "formant", values_to = "values")
# plot
df_mid_long |>
ggplot(aes(x = language, y = values)) +
geom_jitter(aes(colour = language), alpha = 0.5) +
geom_violin(alpha = 0.4) +
geom_boxplot(width = 0.3, alpha = 0.4) +
facet_grid(vowel ~ formant) +
scale_colour_manual(values = cbPalette) +
theme(
strip.text.y = element_text(angle = 360)
)Correlation
Let’s also take a slightly different approach. The objective of PCA is to draw straight lines along the dimension that shows the greatest variance. To inspect the variance in the data, we’ll plot the relationships between the three spectral dimensions.
# F1 and F2
df_mid |>
ggplot(aes(x = f1z, y = f2z)) +
geom_point(aes(colour = segment), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ language)# F2 and F3
df_mid |>
ggplot(aes(x = f2z, y = f3z)) +
geom_point(aes(colour = segment), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ language)# F1 and F3
df_mid |>
ggplot(aes(x = f1z, y = f3z)) +
geom_point(aes(colour = segment), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(~ language)As expected, English /l/ and /ɹ/ can be reliably distinguished along the F3 dimension, such that English /ɹ/ shows lower F3 than English /l/.
Temporal characteristics
OK, then how about duration? Let’s visualise them first,
and then check how it correlates with each of the spectral parameters.
We’ll use the original df_mid data frame again but convert
the duration into millisecond.
Violin plot
# convert duration (s) into (ms)
df_mid <- df_mid |>
dplyr::mutate(
duration_ms = duration * 1000,
duration_ms_z = scale(duration_ms)
)
# plot
df_mid |>
ggplot(aes(x = language, y = duration_ms_z)) +
geom_jitter(aes(colour = language), alpha = 0.5) +
geom_violin(alpha = 0.4) +
geom_boxplot(width = 0.3, alpha = 0.4) +
facet_grid(~ segment) +
scale_colour_manual(values = cbPalette) +
theme(
strip.text.y = element_text(angle = 360)
)Correlation
Let’s also check whether duration correlates with any of the spectral measures.
# with F1
df_mid |>
ggplot(aes(x = f1z, y = duration_ms_z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(segment ~ language)# with F2
df_mid |>
ggplot(aes(x = f2z, y = duration_ms_z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(segment ~ language)# with F3
df_mid |>
ggplot(aes(x = f3z, y = duration_ms_z)) +
geom_point(aes(colour = language), alpha = 0.4) +
scale_colour_manual(values = cbPalette) +
facet_wrap(segment ~ language)Mmm, we do not seem to find particularly interesting correlations here. But this is fine for now, as we’re just exploring this. The key takeaway here is that we always need multiple plots in order to obtain a holistic understanding of the data structure, which is somewhat cumbersome. Here, PCA comes in handy, as it is a dimension reduction technique and boils down the variance in the data into a handful of principal components.
Applying PCA
Let’s apply PCA on the formant data. We’ll classify the tokens based
on the z-scored F1, F2 and F3. We use the prcomp() function
to perform PCA.
# rearrange column order for easier data subsetting
df_mid <- df_mid |>
dplyr::relocate(f1z, f2z, f3z, duration_ms_z)
# check colnames
colnames(df_mid)## [1] "f1z" "f2z" "f3z" "duration_ms_z"
## [5] "...1" "file" "speaker" "language"
## [9] "duration" "segment" "word" "f1"
## [13] "f2" "f3" "previous_sound" "next_sound"
## [17] "percent" "gender" "position" "vowel"
## [21] "duration_ms"
# PCA on F1z, F2z, F3z and duration
pca_mid <- prcomp(select(df_mid, 1:4))
# check summary
summary(pca_mid)## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.1365 1.0302 0.9856 0.7863
## Proportion of Variance 0.3276 0.2692 0.2464 0.1568
## Cumulative Proportion 0.3276 0.5968 0.8432 1.0000
The summary shows three rows:
Standard deviation: This fundamentally shows the variance shown in the data. The greater the value, the more variance is captured by each PC.
Proportion of Variance: This expresses the amount of variance explained by each PC in a proportional manner. This can be calculated with the following formulae:
\[ pca\_var\_exp = \frac{(\text{sdev})^2}{\sum pca\_var} \]
- Cumulative Proportion: This simply shows the sum of the proportion of variance explained by each PC. For instance, PC1 explains 43.37% of variance whereas PC2 33.68%. The cumulative proportion of PC1 and PC2 is thus 77.05% (i.e., 43.37% + 33.68%).
Another important index in PCA is eigenvalues. They represent how much variance is explained by each PC. This sounds similar to some of the values we already saw: standard deviation and (cumulative) proportion. Eigenvalues are computed based on starndard deviation, and then prooprtion of variance is calculated based on eigenvalues.
## [1] 1.2916551 1.0612984 0.9714709 0.6183089
# calculating proportion of variance
pca_var_exp <- eigenvalues / sum(eigenvalues) # Proportion of variance
pca_var_exp## [1] 0.3276040 0.2691783 0.2463953 0.1568224
PCA summary
Let’s also look into the inside of pca_mid.
## Standard deviations (1, .., p=4):
## [1] 1.1365100 1.0301934 0.9856322 0.7863262
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## f1z 0.06228593 0.54478246 0.8043986 0.2286382
## f2z 0.64132789 0.26005576 -0.3971193 0.6027983
## f3z 0.72630007 -0.01361938 0.1439881 -0.6719897
## duration_ms_z 0.23938924 -0.79711830 0.4177398 0.3644018
The top line shows standard deviation, which we just saw earlier, so let’s skip this for now.
At the bottom, we have loadings matrix. This shows eigenvectors: i.e., the amount and (2) the direction of the contribution that the original data dimensions make to each PC. Let’s disentangle this one by one:
PC1: Contributions of
f1z,f2z,f3zanddurationare all in the same direction (i.e., positive). But the amount of contribution is fairly small forf1zanddurationcompared tof2zandf3z.PC2: In contrast to PC1,
f1zcontributes to PC2 to a greater extent thanf2zandf3z.durationalso seems to make a big contribution here but in the opposite direction (i.e., negative).PC3: The contribution of
durationseems quite big here, too.PC4:
f2zandf3zboth seem to contribute to this dimension, although they are in antagonistic direction.
So, all in all, PCA might suggest:
PC1 captures a covariation between
f2zandf3z, which may correspond to the spectral difference between /l/ and /ɹ/.PC2 mainly captures a covariation of
durationandf1zbut in opposite directions.PC3 shows variation in
f1z.PC4 again seems to capture
f2zandf3zbut in opposite directions.
The interpretation may be better facilitated by data visualisation. Let’s try two main approaches to data visualiastion of PCA.
Data visualisation 1: Scree plot
The amount of variance is useful when determining which PC to retain for analysis. As a rule of thumb, Bayeen (2008) recommends that we retain PCs that explains greater than 5% of variance in the data.
A useful visualisation of the proportion of variance is scree
plot. There is a default function screeplot() that
lets you create a scree plot quite easily:
You can also create a scree plot using ggplot() by
manually calculating the proportion of variance based on the standard
deviation (see above for the formula). This is useful when you need more
flexibility – e.g., to show the 5% thereshold suggested by Bayeen
(2008).
## obtaining proportion of variance
pca_var_exp <- pca_mid$sdev^2 / sum(eigenvalues)
# making var_explained as a tibble
pca_var_exp <- tibble::as_tibble(pca_var_exp)
# adding column name
pca_var_exp <- pca_var_exp |>
tibble::as_tibble() |>
dplyr::mutate(
PC = row_number()
)
# create a plot
scree_plot <- pca_var_exp |>
ggplot(aes(x = PC, y = value)) +
geom_line() +
geom_text(aes(label = round(value, digits = 3)*100), nudge_x = 0.2) +
geom_label(aes(label = PC), label.padding = unit(0.40, "lines")) +
geom_hline(yintercept = 0.05, linetype = 'dotted') +
scale_x_continuous(breaks = c(1, 2, 3)) +
labs(x = "Principal Component", y = "Variance Explained", title = "Proportion of Variance explained by each PC")
# showing plot
scree_plot
Since there are only three PCs identified from the data, it doesn’t make
much sense to visualise them. But this will be useful when
prcomp() identifies more than a few PCs.
creating biplot
The interpretation of loading matrix was more or less straightforward in this case given that only three parameters were involved. It is easy to see, however, that this can get quite complicated when the original data have a greater number of dimensions.
Biplots help us better understand the loadings and
PC scores for each observation. For example, the default function
biplot() plots all observations along the PC1 and PC2
dimensions with eigenvectors shown in red arrows (i.e.,
the direction and the amount of weighting: also called
loadings).
Detailed biplots
Similarly to the scree plot, we could also create biplots manually
using ggplot(). This will help us better understand what
each PC could stand for in the current data set.
# Extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)
# Extract PCA loadings (the contribution of each original variable to the PCs)
pca_loadings <- as.data.frame(pca_mid$rotation)
# Combine scores into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel")])
# Reshape the loadings for visualization
loadings_df <- pca_loadings |>
mutate(variable = rownames(pca_loadings))
# Plotting the PCA biplot with facets by language
ggplot() +
geom_point(data = scores_df, aes(x = PC1, y = PC2, color = vowel, shape = vowel), size = 2, alpha = 0.5) + # Plot the PCA scores (observations)
geom_segment(data = loadings_df, aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "red") + # Plot the loadings (variables)
geom_text(data = loadings_df, aes(x = PC1 * 5, y = PC2 * 5, label = variable), size = 4, color = "red") + # Add labels for the loadings
stat_ellipse(data = scores_df, aes(x = PC1, y = PC2, color = vowel, fill = vowel), geom = "polygon", alpha = 0.2, level = 0.95) +
facet_grid(segment ~ language) +
labs(title = "PCA biplot by language") +
scale_color_manual(values = cbPalette) +
scale_fill_manual(values = cbPalette) +
theme(legend.position = "bottom",
strip.text.y = element_text(angle = 360))Using PC scores as data
So far, we have identified four PCs, each of which explains more than
5% of variance in the data. At first glance, this doesn’t seem to be a
good way of data reduction, as we got four PC dimensions out of four
variables (f1z, f2z, f3z, and
duration).
However, we could also argue that data reduction has been indeed
achieved in the sense that some PCs capture joint contributions of
multiple parameters – e.g., PC1 showing covariation of f2z
and f3z. In other words, we have identified in a
data-driven manner that the two spectral parameters act together, and we
have compressed the two parameters into one PC dimension (i.e.,
PC1).
One of the advantages of using PCA is that it converts the data onto new scales in a data-driven manner. More specifically, PCA associates each data point with new numeric values called PC scores, showing variation of each data point along each PC dimensions. As a common practice, we can use PC scores as input to further data visualisation and statistical analysis.
Extracting PC scores
PC scores are stored as x in the PCA output. Let’s use
head() function to only display the first six rows as
otherwise the list would be too big.
## PC1 PC2 PC3 PC4
## [1,] -0.008988979 -0.3169801 1.77770054 -0.5644188
## [2,] 0.306799132 -0.2514384 1.55617832 -0.5399825
## [3,] -0.102459933 0.7019741 1.24214125 -0.9686917
## [4,] 0.210710158 -0.1331311 0.02665082 -1.1115056
## [5,] 0.125017636 -0.5062928 1.38864989 -1.1168562
## [6,] 0.222555474 -0.5372844 1.96536854 -0.4210074
Assuming that we haven’t made any changes to the order of rows, we can combine the PC scores with the existing data set. We have actually already done this when visualising the data, so I simply copy and paste the codes here:
# extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)
# check the number of rows
## pc score
nrow(pca_scores)## [1] 2306
## [1] 2306
Both pc_score and df_mid contain the same
number of rows, so we can merge them into one data frame.
Check PC scores
Before visualisation, let’s take a look at descriptive statistics of PC scores.
scores_df |>
dplyr::group_by(language, segment, vowel) |>
dplyr::summarise(
PC1_mean = mean(PC1),
PC1_sd = sd(PC1),
PC2_mean = mean(PC2),
PC2_sd = sd(PC2),
PC3_mean = mean(PC3),
PC3_sd = sd(PC3),
PC4_mean = mean(PC4),
PC4_sd = sd(PC4)
) |>
dplyr::ungroup()## `summarise()` has grouped output by 'language', 'segment'. You can override
## using the `.groups` argument.
## # A tibble: 12 × 11
## language segment vowel PC1_mean PC1_sd PC2_mean PC2_sd PC3_mean PC3_sd
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 English L /i/ 0.984 0.872 -0.360 0.984 -0.186 1.01
## 2 English L /u/ 0.929 0.840 -0.782 1.08 -0.297 0.785
## 3 English L /æ/ 0.806 0.632 0.00223 1.06 0.980 0.859
## 4 English R /i/ -0.539 0.770 -0.167 1.11 -0.380 0.748
## 5 English R /u/ -0.765 0.679 -0.285 1.07 -0.284 0.612
## 6 English R /æ/ -0.948 0.836 0.285 1.24 0.439 0.782
## 7 Japanese L /i/ 0.870 0.765 0.00313 0.715 -0.561 0.950
## 8 Japanese L /u/ 0.179 0.898 -0.464 0.896 -0.632 0.848
## 9 Japanese L /æ/ 0.525 0.856 0.311 0.936 0.591 0.984
## 10 Japanese R /i/ -0.141 1.03 0.242 0.823 -0.405 0.783
## 11 Japanese R /u/ -0.969 0.910 -0.178 0.800 -0.453 0.702
## 12 Japanese R /æ/ -0.742 1.08 0.399 1.02 0.426 0.792
## # ℹ 2 more variables: PC4_mean <dbl>, PC4_sd <dbl>
We get lots of numbers here – please do feel free to take a moment to digest these. But we could also visualise them to better facilitate the interpretation.
Visualising PCs
PC1
Let’s visualise PC scores to see if our assumptions about each PC is
correct. We think that PC1 captures covariation of f2z and
f3z, and this should correspond to the contrast between
English /l/ (lower F2, higher F3) and /ɹ/ (higher F2, lower F3).
## PC1 - by liquid consonant
scores_df |>
ggplot(aes(x = segment, y = PC1, colour = segment)) +
geom_jitter(width = 0.3) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3, colour = "black") +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
scale_colour_manual(values = cbPalette) +
facet_grid(language ~ vowel) +
theme(
strip.text.y = element_text(angle = 360)
)This is more or less true, good! We could also make a between-group comparison.
## PC1 - by group
scores_df |>
ggplot(aes(x = language, y = PC1, colour = language)) +
geom_jitter(width = 0.3) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3, colour = "black") +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
scale_colour_manual(values = cbPalette) +
facet_grid(segment ~ vowel) +
theme(
strip.text.y = element_text(angle = 360)
)We can see some patterns here: e.g., both L1 English and L1 Japanese speakers are similar for /l/, L1 Japanese speakers produce smaller PC1 scores in the /u/ context than L1 English speakers. Some between-group differences can be found for English /ɹ/ as well.
Your turn
In a similar manner to PC1, please visualise PC2, PC3 and PC4. Please try different visualisation methods to explore what each PC dimension highlights.
PC2
PC2 mainly captured covariation between duration and
f1z. This is a little less intuitive, but previous research
has indeed shown that L1 English and L1 Japanese speakers may exhibit
different F1 transition duration profiles. So, it would
be interesting to visualise PC2 to show between-group differences:
PC3
PC3 captures variation of f1z predominantly. So I would
assume that this captures variation of the vocalic context – e.g.,
higher F1 for low vowels.
Statistical analysis
As I said earlier, as a somewhat common practice, PC scores can serve as input to further statistical analysis. So far, we have found that PC1 captures the contrast between English /l/ and /ɹ/; although this is sort of obvious, the most important message here is that the data tells us about it without any apriori knowledge.
In the data visualisation above, it was somewhat difficult to
identify which of these variables: language,
vowel and segment, would have a significant
effect. Let’s construct a linear mixed effect model to formally test
this.
Your turn
Using knowledge from yesterday, please construct a linear mixed-effect model to predict z-noramlised PC1 values by:
languagesegmentan interaction between
languageandsegment
Hint: My hunch here is that by-speaker random effect would lead to a singular fit warning, meaning that it causes issues to the model estimate. One possible account might be that PC1z may not be showing much speaker-specific variation because we have normalised PC values that were based on spectral and duration values that had already been within-speaker normalised.
# converting variables into factor
scores_df <- scores_df |>
dplyr::mutate(
language = as.factor(language),
vowel = as.factor(vowel),
segment = as.factor(segment),
speaker = as.factor(speaker),
word = as.factor(word)
)
# run a full model
m1 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)
# model summary
summary(m1)## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## PC1 ~ language + segment + vowel + language:segment + language:vowel +
## (1 | speaker) + (1 | word)
## Data: scores_df
##
## AIC BIC logLik deviance df.resid
## 5929.4 5992.5 -2953.7 5907.4 2295
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1466 -0.6199 -0.0194 0.5930 4.9442
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker (Intercept) 0.009092 0.09535
## word (Intercept) 0.005558 0.07455
## Residual 0.748043 0.86490
## Number of obs: 2306, groups: speaker, 45; word, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.74967 0.06889 10.882
## languageJapanese -0.28313 0.07487 -3.782
## segmentR -1.65190 0.06750 -24.471
## vowel/i/ 0.29449 0.07842 3.756
## vowel/u/ 0.15406 0.08607 1.790
## languageJapanese:segmentR 0.49938 0.07353 6.792
## languageJapanese:vowel/i/ 0.18394 0.08524 2.158
## languageJapanese:vowel/u/ -0.43883 0.09347 -4.695
##
## Correlation of Fixed Effects:
## (Intr) lnggJp sgmntR vowel/i/ vowel/u/ lngJ:R
## languagJpns -0.673
## segmentR -0.476 0.300
## vowel/i/ -0.537 0.336 -0.042
## vowel/u/ -0.496 0.312 -0.024 0.447
## lnggJpns:sR 0.299 -0.475 -0.638 0.039 0.022
## languageJapanese:vowel/i/ 0.336 -0.499 0.039 -0.642 -0.285 -0.057
## languageJapanese:vowel/u/ 0.313 -0.463 0.022 -0.285 -0.633 -0.035
## languageJapanese:vowel/i/
## languagJpns
## segmentR
## vowel/i/
## vowel/u/
## lnggJpns:sR
## languageJapanese:vowel/i/
## languageJapanese:vowel/u/ 0.425
# nested model 1 -- testing the language-segment interaction
m2 <- lme4::lmer(PC1 ~ language + segment + vowel + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)
# model comparison
anova(m1, m2, test = "Chisq")## Data: scores_df
## Models:
## m2: PC1 ~ language + segment + vowel + language:vowel + (1 | speaker) + (1 | word)
## m1: PC1 ~ language + segment + vowel + language:segment + language:vowel + (1 | speaker) + (1 | word)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m2 10 5973.0 6030.4 -2976.5 5953.0
## m1 11 5929.4 5992.5 -2953.7 5907.4 45.612 1 1.441e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# nested model 2 -- testing the language-vowel interaction
m3 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + (1|speaker) + (1|word), data = scores_df, REML = FALSE)
# model comparison
anova(m1, m3, test = "Chisq")## Data: scores_df
## Models:
## m3: PC1 ~ language + segment + vowel + language:segment + (1 | speaker) + (1 | word)
## m1: PC1 ~ language + segment + vowel + language:segment + language:vowel + (1 | speaker) + (1 | word)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m3 9 5968.0 6019.7 -2975.0 5950.0
## m1 11 5929.4 5992.5 -2953.7 5907.4 42.679 2 5.399e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# post-hoc analysis
## between-group effects of each language
emmeans::emmeans(m1, pairwise ~ language | segment:vowel, adjust = "tukey")## $emmeans
## segment = L, vowel = /æ/:
## language emmean SE df lower.CL upper.CL
## English 0.750 0.0746 63.5 0.6007 0.8986
## Japanese 0.467 0.0647 37.0 0.3354 0.5977
##
## segment = R, vowel = /æ/:
## language emmean SE df lower.CL upper.CL
## English -0.902 0.0754 66.0 -1.0528 -0.7516
## Japanese -0.686 0.0651 37.4 -0.8178 -0.5542
##
## segment = L, vowel = /i/:
## language emmean SE df lower.CL upper.CL
## English 1.044 0.0769 68.1 0.8907 1.1976
## Japanese 0.945 0.0689 45.6 0.8062 1.0838
##
## segment = R, vowel = /i/:
## language emmean SE df lower.CL upper.CL
## English -0.608 0.0748 62.5 -0.7572 -0.4583
## Japanese -0.208 0.0672 42.5 -0.3432 -0.0719
##
## segment = L, vowel = /u/:
## language emmean SE df lower.CL upper.CL
## English 0.904 0.0858 61.9 0.7323 1.0752
## Japanese 0.182 0.0782 42.6 0.0239 0.3396
##
## segment = R, vowel = /u/:
## language emmean SE df lower.CL upper.CL
## English -0.748 0.0849 60.1 -0.9180 -0.5783
## Japanese -0.971 0.0772 40.9 -1.1267 -0.8148
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
##
## $contrasts
## segment = L, vowel = /æ/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 0.2831 0.0756 216 3.745 0.0002
##
## segment = R, vowel = /æ/:
## contrast estimate SE df t.ratio p.value
## English - Japanese -0.2162 0.0768 230 -2.816 0.0053
##
## segment = L, vowel = /i/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 0.0992 0.0814 268 1.218 0.2242
##
## segment = R, vowel = /i/:
## contrast estimate SE df t.ratio p.value
## English - Japanese -0.4002 0.0780 237 -5.129 <.0001
##
## segment = L, vowel = /u/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 0.7220 0.0893 392 8.083 <.0001
##
## segment = R, vowel = /u/:
## contrast estimate SE df t.ratio p.value
## English - Japanese 0.2226 0.0876 371 2.541 0.0115
##
## Degrees-of-freedom method: kenward-roger
Summary
I hope I have shown that PCA is a strong tool to identify salient dimensions in the data in a bottom-up/data-driven manner. It projects the data onto a new dimension based on the identified principal components. Furthermore, we can obtain PC scores for each observation along the PC dimensions, and we can use them as input to further statistical analysis. The flexibility in the choice of statistical modelling is an advantage of PCA.
Summarising what we have found so far:
PCA here identifies four principal components (PCs) that explains 32.76% (PC1), 26.92% (PC2), 24.64% (PC3) and 15.68% (PC4) of the variance in the data.
The largest proportion of variance is explained by PC1 that captures a covariation of F2 and F3 as suggested by the biplot. The data visualisation clearly shows that English /l/ and /ɹ/ can be distinguished along PC1 (English /l/: higher PC1 values, English /ɹ/: lower PC1 values)
We can also use PC scores as data for further statistical analysis. Analysis with linear mixed-effect models shows that:
the interactions
language:vowelandlanguage:segmentare both statistically significantL1 Japanese and L1 English speakers differ in PC1 values at statistically significant level across all conditions but for English /l/ in the /i/ context.
Functional Principal Component Analysis (FPCA)
In the previous section, we performed principal component analysis (PCA) on the static data – i.e., the midpoint measurement of formant frequencies in the production of English /l/ and /ɹ/ by L1 Japanese and L1 English speakers. We have found that F2 and F3 seem to be the key in understanding their production judging from their contributions to PC1.
Whereas we have identified what parameters would characterise the between-group difference, it does not tell us much about how the two speaker groups differ. More specifically, we now know that F2 and F3 are important to understand English /l/ and /ɹ/ productions better, our ultimate aim is to know how L1 Japanese speakers differ from L1 English speakers in producing English /l/ and /ɹ/. Let’s turn to dynamic analysis and explore how the two groups of speakers differ in the realisations of F2 and F3 over time.
We will use Functional Principal Component Analysis (FPCA) to explore salient dynamic properties in the data. The basic architecture is quite similar to the ordinary PCA that we have just seen with a couple of key differences as it deals with functional data.
Preliminaries
Installing/loading packages
For the dynamic analysis using FPCA, we are using
fdapace() package, so let’s install it here.
Data wrangling
Check data
We always start with inspecting the data set using
colnames().
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "IsApproximant" "IsAcoustic"
## [17] "note" "gender" "omit" "Barkf1"
## [21] "Barkf2" "Barkf3" "f2f1" "f3f2"
## [25] "Barkf2f1" "Barkf3f2" "position" "context"
## [29] "liquid" "input_file"
Omitting irrelavent columns
We’ll omit the columns we don’t need.
# Let's check the number of "approximant" tokens
df_dyn |>
dplyr::group_by(IsApproximant) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsApproximant
## <chr>
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |>
dplyr::group_by(IsAcoustic) |>
dplyr::summarise() |>
dplyr::ungroup()## # A tibble: 1 × 1
## IsAcoustic
## <chr>
## 1 yes
## # A tibble: 1 × 1
## omit
## <dbl>
## 1 0
# Remove columns that we no longer need
df_dyn <- df_dyn |>
dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))Let’s check the column names again.
## [1] "...1" "file" "speaker" "language"
## [5] "f0" "duration" "segment" "previous_sound"
## [9] "next_sound" "word" "f1" "f2"
## [13] "f3" "time" "note" "gender"
## [17] "position" "context" "liquid" "input_file"
Let’s also convert the context column into IPA symbols
for a more intuitive representation:
Checking the number of participants, tokens…
Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.
# number of participants
df_dyn |>
dplyr::group_by(language) |>
dplyr::summarise(n = n_distinct(speaker)) |>
dplyr::ungroup()## # A tibble: 2 × 2
## language n
## <chr> <int>
## 1 English 14
## 2 Japanese 31
# number of tokens per segment
df_dyn |>
dplyr::group_by(segment) |>
dplyr::summarise(n = n()/11) |> # divide by 11 time points
dplyr::ungroup()## # A tibble: 6 × 2
## segment n
## <chr> <dbl>
## 1 LAE1 511
## 2 LIY1 408
## 3 LUW1 299
## 4 RAE1 502
## 5 RIY1 481
## 6 RUW1 314
Data visualisation
Scaling formant frequencies
Do you remember how to visualise the dynamic data? The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.
df_dyn <- df_dyn |>
dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
dplyr::mutate(
f1z = as.numeric(scale(f1)), # scale f1 into z-score
f2z = as.numeric(scale(f2)), # scale f2 into z-score
f3z = as.numeric(scale(f3)) # scale f3 into z-score
) |>
dplyr::ungroup() # don't forget ungroupingDescriptive statistics
Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.
# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |>
dplyr::group_by(speaker) |>
dplyr::summarise(
f1_mean = mean(f1),
f1_sd = sd(f1),
f1z_mean = mean(f1z),
f1z_sd = sd(f1z)
) |>
dplyr::ungroup() ## # A tibble: 45 × 5
## speaker f1_mean f1_sd f1z_mean f1z_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2d57ke 468. 140. -9.05e-17 1
## 2 2drb3c 575. 243. 2.39e-16 1
## 3 2zy9tf 459. 228. -2.41e-16 1
## 4 3bcpyh 438. 142. -5.72e-18 1.00
## 5 3hsubn 537. 177. 6.11e-17 1
## 6 3pzrts 458. 195. -1.44e-18 1.00
## 7 4ps8zx 467. 172. 2.19e-16 1
## 8 54i2ks 453. 192. 1.43e-16 1.00
## 9 5jzj2h 412. 133. 2.49e-16 1.00
## 10 5upwe3 444. 189. -6.51e-17 1.00
## # ℹ 35 more rows
Visualisation
raw trajectories
Let’s visualise the raw data first:
# F2 - raw trajectories
df_dyn |>
ggplot(aes(x = time, y = f2z)) +
geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F2 (z-normalised)", title = "time-varying change in F2 frequency") +
theme(strip.text.y = element_text(angle = 0))smooths
Let’s also try plotting smoothed trajectories:
# F2 - smooths
df_dyn |>
ggplot(aes(x = time, y = f2z)) +
# geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
# geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
scale_colour_manual(values = cbPalette) +
facet_grid(liquid ~ context) +
labs(x = "time", y = "F2 (z-normalised)", title = "smoothed time-varying change in F2 frequency") +
theme(strip.text.y = element_text(angle = 0))## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Functional Principal Component Analysis (FPCA)
At this point, we tried fitting Generalised Additive Mixed-Effect Models (GAMMs) to investigate between-group differences in F2 dynamics. This was possible because we had some ideas of possible factors (e.g., vowel context, speaker groups). In a way, lots of decisions were made in a top-down manner.
We could also try another approach; a bottom-up approach. That is, does the data tell us anything about higher-order groupings? Do the vowel context/speaker group differences emerge from the data?
Here, let’s try Functional Principal Component Analysis (FPCA) as a bottom-up approach.
# IDs = token column; tVec = time column; yVec = variable column(s)
input_df <- fdapace::MakeFPCAInputs(IDs = df_dyn$file, tVec = df_dyn$time, yVec = df_dyn$f2z)
# Check if there's any issues with the data
fdapace::CheckData(input_df$Ly, input_df$Lt)
# No errors have been returned, so let's now run fPCA on the dynamic PC1 trajectory
df_dyn_fpca <- fdapace::FPCA(Ly = input_df$Ly, input_df$Lt, optns = list(plot = TRUE))Checking fpca results
Let’s look into the results here. summary(df_dyn_fpca)
tells you what attributes are stored in the FPCA data:
## Length Class Mode
## sigma2 1 -none- numeric
## lambda 4 -none- numeric
## phi 44 -none- numeric
## xiEst 10060 -none- numeric
## xiVar 2515 -none- list
## obsGrid 11 -none- numeric
## workGrid 11 -none- numeric
## mu 11 -none- numeric
## smoothedCov 121 -none- numeric
## FVE 1 -none- numeric
## cumFVE 4 -none- numeric
## fittedCov 121 -none- numeric
## fittedCorr 121 -none- numeric
## optns 35 -none- list
## bwMu 0 -none- NULL
## bwCov 0 -none- NULL
## inputData 2 -none- list
## selectK 1 -none- numeric
## criterionValue 1 -none- numeric
## timings 4 difftime numeric
Eigenvalues are stored in lambda. Similarly in the
ordinary PCA, this shows how much variance is explained by each FPCA.
This case, FPC1 explains quite a lot of variance in the data, so we
might only need to look at FPC1 to understand an overall trend in the F2
dynamics.
## [1] 73.431312 13.589246 4.373711 1.285172
The cumulative proportion of variance is stored in
cumFVE.
## [1] 0.7859692 0.9314212 0.9782350 0.9919908
And as we did for the oridinary PCA, we can calculate
cumFVE using eigenvalues (lambda). This manual
approach differs slightly from the output shown with
df_dyn_fpca$cumFVE reflecting slightly different
calculation approaches, but you can see that the overall trend is still
quite similar.
# calculating proportion of variance from eigenvalues
fpca_var_exp <- df_dyn_fpca$lambda / sum(df_dyn_fpca$lambda)
# compare this with cumFVE
fpca_var_exp## [1] 0.79231501 0.14662633 0.04719182 0.01386685
## [1] 0.7859692 0.9314212 0.9782350 0.9919908
The dynamic analysis introduces the time dimension.
df_dyn_fpca$workGrid gives you the time points at which
data are sampled.
## [1] 0 10 20 30 40 50 60 70 80 90 100
PC scores are stored in df_dyn_fpca$xiEst. Each row is 1
token, and each column corresponds to each FPC dimension.
## [,1] [,2] [,3] [,4]
## [1,] 1.3978292 5.260927 2.1324070 -2.52256410
## [2,] -1.5626684 5.595501 0.2863422 -1.84886236
## [3,] -0.6166366 4.291694 1.1572724 -1.35058648
## [4,] -5.5315817 2.679823 0.1415474 0.08740263
## [5,] 0.5428989 3.542994 3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022
Finally, you can get a mean curve directly by applying
GetMeanCurve and a scree plot with
CreateScreePlot in the fdapace package.
CreatePathPlot returns a plot showing individual
trajectories.
## $mu
## [1] -0.70831425 -0.56725924 -0.32739985 -0.04873545 0.18428169 0.31887341
## [7] 0.37257545 0.37071962 0.31025622 0.17889910 -0.08389668
##
## $workGrid
## [1] 0 10 20 30 40 50 60 70 80 90 100
##
## $bwMu
## NULL
##
## $optns
## $optns$userBwMu
## [1] 5
##
## $optns$methodBwMu
## [1] "Default"
##
## $optns$userBwCov
## [1] 10
##
## $optns$methodBwCov
## [1] "Default"
##
## $optns$kFoldMuCov
## [1] 10
##
## $optns$methodSelectK
## [1] "FVE"
##
## $optns$FVEthreshold
## [1] 0.99
##
## $optns$FVEfittedCov
## NULL
##
## $optns$fitEigenValues
## [1] FALSE
##
## $optns$maxK
## [1] 9
##
## $optns$dataType
## [1] "Dense"
##
## $optns$error
## [1] TRUE
##
## $optns$shrink
## [1] FALSE
##
## $optns$nRegGrid
## [1] 11
##
## $optns$rotationCut
## [1] 0.25 0.75
##
## $optns$methodXi
## [1] "IN"
##
## $optns$kernel
## [1] "epan"
##
## $optns$lean
## [1] FALSE
##
## $optns$diagnosticsPlot
## NULL
##
## $optns$plot
## [1] TRUE
##
## $optns$numBins
## NULL
##
## $optns$useBinnedCov
## [1] TRUE
##
## $optns$usergrid
## [1] FALSE
##
## $optns$yname
## [1] "Ly"
##
## $optns$methodRho
## [1] "vanilla"
##
## $optns$verbose
## [1] FALSE
##
## $optns$userMu
## NULL
##
## $optns$userCov
## NULL
##
## $optns$methodMuCovEst
## [1] "cross-sectional"
##
## $optns$userRho
## NULL
##
## $optns$userSigma2
## NULL
##
## $optns$outPercent
## [1] 0 1
##
## $optns$useBinnedData
## [1] "AUTO"
##
## $optns$useBW1SE
## [1] FALSE
##
## $optns$imputeScores
## [1] TRUE
# path plot
fdapace::CreatePathPlot(df_dyn_fpca, xlab = "normalised time", ylab = "F2 (z-normalised)")Understanding variation captured by FPCs
We have seen that our FPCA analysis identifies that FPC1 captures the majority of variation observed in the data. Let’s check the details of this. The code below is from Strycharczuk et al. (2024) – see the file “diphthongisation_paper.html” stored in the repository.
The code below lets you visualise what variation is captured in FPC1 by adding and subtracting standard deviation to/from the mean curve.
# function: get PC scores + return data frame with PCs for each token
get_pc_scores <- function(fpcaObj){
pcs <- data.frame(fpcaObj$xiEst)
token <- names(fpcaObj$inputData$Lt)
df <- cbind(token, pcs)
n_pcs <- length(fpcaObj$lambda) # get number of PCs
pc_names <- paste0("PC", 1:n_pcs) # create colnames for PCs
names(df) <- c("file", pc_names) # add colnames for token + PCs
return(df)
}
# get PC scores w/ token info
pc_df <- get_pc_scores(df_dyn_fpca)
# join PCs (dat) with selected cols from original data frame
## store meta info
meta <- df_dyn |>
select(speaker, gender, language, word, liquid, context, file)
## merge the list and meta data - unique(meta) because otherwise there would be lots of duplicates
dat <- left_join(pc_df, unique(meta), by = "file")
# function: define perturbation function (±Q = ±sd, k = PC number)
perturbation <- function(fpcaObj, Q, k){
Q * sqrt(fpcaObj$lambda[k]) * fpcaObj$phi[,k] + fpcaObj$mu
}
# function: create perturbation object with mean and ±Q sd as a data frame (for one PC only)
# can validate against fdapace::GetMeanCurve and fdapace::CreateModeOfVarPlot
perturbation_object <- function(fpcaObj, Q, k){
time <- fpcaObj$workGrid # grid of time values
mean <- fpcaObj$mu # mean trajectory
Qplus <- perturbation(fpcaObj, Q, k) # +Q sd
Qminus <- perturbation(fpcaObj, -Q, k) # -Q sd
df <- cbind(time, mean, Qplus, Qminus)
colnames(df) <- c("time", "mean", "Qplus", "Qminus")
df <- data.frame(df)
df$PC <- paste0("PC", k) # add PC colname
return(df)
}
# function: create perturbation data frame with mean and ±Q sd (for all PCs)
# to do: add ability to pass list of Q values for gradient perturbation function
get_perturbation <- function(fpcaObj, Q){
n_pcs <- length(fpcaObj$lambda)
k <- 1:n_pcs
df <- lapply(k, perturbation_object, fpcaObj=fpcaObj, Q=Q)
df <- dplyr::bind_rows(df) # unnest lists into single df
return(df)
}
# get mean trajectory and ±2 sd for all PCs
Q <- seq(from = -2, to = 2, by = 0.1)
pQ <- lapply(Q, get_perturbation, fpcaObj = df_dyn_fpca)
names(pQ) <- Q # name pQ lists using value of Q
pQ <- dplyr::bind_rows(pQ, .id = "Q") # collapse lists together
pQ$Q <- as.numeric(pQ$Q) # make 'Q' column numeric
# visualise variation along each FPC
pQ |>
dplyr::filter(PC %in% c('PC1','PC2', 'PC3', 'PC4')) |>
dplyr::mutate(
PC = case_when(
PC == "PC1" ~ "FPC1",
PC == "PC2" ~ "FPC2",
PC == "PC3" ~ "FPC3",
PC == "PC4" ~ "FPC4"
)
) |>
tidyr::pivot_longer(Qplus:Qminus, names_to = "Qsd", values_to = "value") |>
ggplot2::ggplot() +
aes(x = time, y = value, colour = Q, group = interaction(Q, Qsd)) +
geom_path() +
facet_wrap(~ PC, ncol = 2) +
scale_colour_gradient2(low = "#E69F00", mid = "#56B4E9", high = "#009E73", midpoint = 0)+
labs(x = "Time (normalised)", y = "Reconstructed F2 values", color = "PC score")Remember that FPC1 explains 79.23% of the variance in the data. FPC1 shows that the variation is larger towards the end of the interval, which seems to suggest that FPC1 captures the difference in the shape and height of F2 trajectories depending on the vowel context. FPC2, on the other hand, shows a greater variation closer to the onset of the interval, which might correspond to the difference between English /l/ and /ɹ/.
Reconstructing F2 trajectories based on FPCs
We can reconstruct the F2 trajectories using the information captured by each FPC to better understand the dimension captured by FPC1.
working out necessary parameters
Let’s first work out parameters required for reconstructing the
original PC1 trajectories. This includes: (1) mean curve, (2) time
points used to fit FPCA, (3) eigenfunction associated with each time
point and (4) PC loadings for each PC. These are stored in the
fdapace::FDA object that we obtained from the FPCA analysis
earlier. The following code computes eigenfunctions manually.
# mean fPC1 trajectory
# pc1_mean_curve <- fdapace::GetMeanCurve(Ly = input.PC1$Ly, Lt = input.PC1$Lt, optns = list(plot = TRUE))
mu_values <- data.frame(df_dyn_fpca$mu) # mean curve values
mu_time <- data.frame(df_dyn_fpca$workGrid) # timepoints used for estimating the curve
phi <- data.frame(df_dyn_fpca$phi) # eigenfunction at each timepoint: workGrid * nlambda (e.g., 255 = 51 workGrid * 5 lambda)
lambda <- data.frame(df_dyn_fpca$lambda) # PC loadings for each PC: currently 5
# create a data frame containing mean curve, time and eigenfunctions assocaited with each PC at each time point
## add an extra column 'col_number' as a common index across the data frames - useful when merging everything together later on
### mean curve
mu_values <- mu_values |>
dplyr::mutate(
col_number = row_number()
)
### sampling time points
mu_time <- mu_time |>
dplyr::mutate(
col_number = row_number()
)
### eigenfunction
phi <- phi |>
dplyr::mutate(
col_number = row_number()
)
### pc loadings
lambda <- lambda |>
dplyr::mutate(
PC = str_c("PC", row_number()),
PC = str_c(PC, "lambda", sep = "_")
) |>
tidyr::pivot_wider(names_from = "PC", values_from = "df_dyn_fpca.lambda") |>
dplyr::slice(rep(1:n(), each = 51)) |>
dplyr::mutate(
col_number = row_number()
)
## merging all data together one by one
rec <- dplyr::left_join(mu_values, mu_time, by = "col_number")
rec <- dplyr::left_join(rec, phi, by = "col_number")
rec <- dplyr::left_join(rec, lambda, by = "col_number")
## tidying up some column names
rec <- rec |>
dplyr::select(col_number, df_dyn_fpca.workGrid, df_dyn_fpca.mu, X1, X2, X3, X4, PC1_lambda, PC2_lambda, PC3_lambda, PC4_lambda) |>
dplyr::rename(
mean = df_dyn_fpca.mu,
time = df_dyn_fpca.workGrid,
PC1_eigen = X1,
PC2_eigen = X2,
PC3_eigen = X3,
PC4_eigen = X4
)
## plotting the eigenfunctions - this should match with a sub-plot in bottom right created with plot(PC1)
rec |>
ggplot() +
# geom_path(aes(x = time, y = mean)) +
geom_path(aes(x = time, y = PC1_eigen), colour = "black", linewidth = 1.5) +
geom_path(aes(x = time, y = PC2_eigen), colour = "red", linetype = 2, linewidth = 1.5) +
geom_path(aes(x = time, y = PC3_eigen), colour = "darkgreen", linetype = 3, linewidth = 1.5) +
# geom_path(aes(x = time, y = value, colour = pc)) +
geom_hline(yintercept = 0) +
labs(x = "time", y = "eigenfunctions", title = "First 3 eigenfunctions")Let’s now merge the FPCA parameters with the meta data.
## [,1] [,2] [,3] [,4]
## [1,] 1.3978292 5.260927 2.1324070 -2.52256410
## [2,] -1.5626684 5.595501 0.2863422 -1.84886236
## [3,] -0.6166366 4.291694 1.1572724 -1.35058648
## [4,] -5.5315817 2.679823 0.1415474 0.08740263
## [5,] 0.5428989 3.542994 3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022
## file PC1 PC2 PC3 PC4 speaker
## 1 EN_5jzj2h_lamb015_0001 1.3978292 5.260927 2.1324070 -2.52256410 5jzj2h
## 2 EN_5jzj2h_lamb046_0001 -1.5626684 5.595501 0.2863422 -1.84886236 5jzj2h
## 3 EN_5jzj2h_lamb077_0001 -0.6166366 4.291694 1.1572724 -1.35058648 5jzj2h
## 4 EN_5jzj2h_lamb108_0001 -5.5315817 2.679823 0.1415474 0.08740263 5jzj2h
## 5 EN_5jzj2h_lamb139_0001 0.5428989 3.542994 3.8331163 -2.47430535 5jzj2h
## 6 EN_5jzj2h_lamp033_0001 -2.7433146 7.071773 -2.1509403 -1.88748022 5jzj2h
## gender language word liquid context
## 1 Male English lamb L /æ/
## 2 Male English lamb L /æ/
## 3 Male English lamb L /æ/
## 4 Male English lamb L /æ/
## 5 Male English lamb L /æ/
## 6 Male English lamp L /æ/
# duplicate each row by 51 times
dat_time <- dat |>
dplyr::slice(rep(1:n(), each = 51))
# add col_names to merge with the other data frame
dat_time <- dat_time |>
dplyr::group_by(file) |>
dplyr::mutate(
col_number = row_number()
) |>
ungroup()
# merge
dat_time <- left_join(dat_time, rec, by = "col_number")### Reconstructed individual F2 trajectories based on FPC1
Finally, here is the reconstructed F2 trajectories based on FPC scores for FPC1.
# reconstruct individual trajectory tokens
dat_time <- dat_time |>
dplyr::mutate(
PC1_reconstruct = PC1 * PC1_eigen + mean,
PC2_reconstruct = PC2 * PC2_eigen + mean,
PC3_reconstruct = PC3 * PC3_eigen + mean,
PC4_reconstruct = PC4 * PC4_eigen + mean
)
# visualise FPC1
dat_time |>
ggplot() +
geom_path(aes(x = time, y = PC1_reconstruct, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC1") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_grid(liquid ~ language)## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).
It is indeed true that FPC1 seems to capture variation associated with vowel context. The visualisation suggests that L1 English speakers have a smaller variation especially for English /ɹ/. It is also quite evident that L1 Japanese speakers do not seem to differentiate F2 trajectories between /æ/ and /u/, which is rather surprising.
### Reconstructed individual F2 trajectories based on FPC2
What does F2 trajectory looks like when reconstructed from FPC2?
# visualise FPC2
dat_time |>
ggplot() +
geom_path(aes(x = time, y = PC2_reconstruct, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC2") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_grid(liquid ~ language)## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).
OK, so there’s something interesting here: we speculated earlier that FPC2 may be capturing the difference between English /l/ and /ɹ/, but it doesn’t seem to be the case. Rather, FPC2 seems to capture the degree of vocalic anticipatory coarticulation that can be observed during the liquid interval.
### Reconstructed individual F2 trajectories based on FPC1+FPC2
Finally, it is also possible to account for a joint contribution of FPC1 and FPC2 by summing the values together:
# reconstruct individual trajectory tokens
dat_time <- dat_time |>
dplyr::mutate(
joint_PC1_PC2 = (PC1 * PC1_eigen) + (PC2 * PC2_eigen) + mean
)
# visualise FPC1+FPC2
dat_time |>
ggplot() +
geom_path(aes(x = time, y = joint_PC1_PC2, group = file, colour = context), alpha = 0.2, show.legend = TRUE) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(x = "Proportional Time (%)", y = "F2", title = "Reconstructed F2 from FPC1+FPC2") +
geom_hline(yintercept = 0, linetype = 1, linewidth = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
facet_grid(liquid ~ language)## Warning: Removed 100600 rows containing missing values or values outside the scale range
## (`geom_path()`).